This work has been created for the Course Project at the Practical Machine Learning course in Coursera (Aug 2014). In this assigment we have to develop a prediction model about the manner in which the users did the exercise. To develop this task we use the database data created at the project http://groupware.les.inf.puc-rio.br/har where a group of users were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Our final task is to construct a predictor that can be used to determine which kind of exercise is related to some new data.
In order to exectute the R code the following libraries must be loaded.
library(caret)
library(ggplot2)
library(corrgram)
library(gridExtra)
library(randomForest)
We start reading the movement training and predicting (used for data prediction submission) databases.
## Read training data.
training.csv <- read.csv('pml-training.csv', header=TRUE)
predicting <- read.csv('pml-testing.csv', header=TRUE)
Let’s filter the training data by getting the data related to users movement but not the data related to the summary of each exercise window time.
# Filter and avoid window movement summary data.
training.csv <- training.csv[training.csv$new_window =='no',]
Now we create the data partition to create training data for the model and the cross validation test. We use 60% of data for training and 40% for cross validation testing.
## Create data parition
inTrain <- createDataPartition(training.csv$classe, p = 0.60)[[1]]
training <- training.csv[inTrain,]
testing <- training.csv[-inTrain,]
Next we clean the data in order to apply the exploratory data analysis to select the variables that will be used as predictors. We first create the complete list of available predictor candidates (obtained form the predicting data frame) as follows:
# Columns candidates to develop predictions.
colnames.predicting <- c('roll_belt', 'pitch_belt', 'yaw_belt', 'total_accel_belt')
colnames.predicting <- c(colnames.predicting, 'gyros_belt_x', 'gyros_belt_y', 'gyros_belt_z')
colnames.predicting <- c(colnames.predicting, 'accel_belt_x', 'accel_belt_y', 'accel_belt_z')
colnames.predicting <- c(colnames.predicting, 'magnet_belt_y', 'magnet_belt_x', 'magnet_belt_z')
colnames.predicting <- c(colnames.predicting, 'roll_arm', 'pitch_arm', 'yaw_arm', 'total_accel_arm')
colnames.predicting <- c(colnames.predicting, 'gyros_arm_x', 'gyros_arm_y', 'gyros_arm_z')
colnames.predicting <- c(colnames.predicting, 'accel_arm_x', 'accel_arm_y', 'accel_arm_z')
colnames.predicting <- c(colnames.predicting, 'magnet_arm_x', 'magnet_arm_y', 'magnet_arm_z')
colnames.predicting <- c(colnames.predicting, 'roll_dumbbell', 'pitch_dumbbell', 'yaw_dumbbell')
colnames.predicting <- c(colnames.predicting, 'gyros_dumbbell_x', 'gyros_dumbbell_y', 'gyros_dumbbell_z')
colnames.predicting <- c(colnames.predicting, 'accel_dumbbell_x', 'accel_dumbbell_y', 'accel_dumbbell_z')
colnames.predicting <- c(colnames.predicting, 'magnet_dumbbell_x', 'magnet_dumbbell_y', 'magnet_dumbbell_z')
colnames.predicting <- c(colnames.predicting, 'roll_forearm', 'pitch_forearm', 'yaw_forearm', 'total_accel_forearm')
colnames.predicting <- c(colnames.predicting, 'gyros_forearm_x', 'gyros_forearm_y', 'gyros_forearm_z')
colnames.predicting <- c(colnames.predicting, 'accel_forearm_x', 'accel_forearm_y', 'accel_forearm_z')
colnames.predicting <- c(colnames.predicting, 'magnet_forearm_x', 'magnet_forearm_y', 'magnet_forearm_z')
and finally clean the data using this list of columns.
# Create training and testing clean data frames.
testing.clean <- testing[, c('classe', colnames.predicting)]
training.clean <- training[, c('classe', colnames.predicting)]
In order to make a selection of the predictors we use the following plot of the matrix correlation.
## Plot the correlation diagram.
corrgram(training.clean, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Complete variables correlation matrix")
In this figure we can visualize which variables are highly correlated because they appear with colours very resalted (red for negative and blue for possitive correlation values) so we can select them some of theses variables an avoid others.
After the exploration of this diagram we take the list of variables with a correlation below 50% (negative of possitive), which lead to the following list of 21 predictor candidates.
#Columns candidates to develop predictions without correlated predictors.
colnames.predicting <- c('roll_belt', 'pitch_belt')
colnames.predicting <- c(colnames.predicting, 'gyros_belt_x', 'gyros_belt_y', 'gyros_belt_z')
colnames.predicting <- c(colnames.predicting, 'magnet_belt_y')
colnames.predicting <- c(colnames.predicting, 'roll_arm', 'pitch_arm', 'yaw_arm', 'total_accel_arm')
colnames.predicting <- c(colnames.predicting, 'gyros_arm_x')
colnames.predicting <- c(colnames.predicting, 'accel_arm_x')
colnames.predicting <- c(colnames.predicting, 'roll_dumbbell', 'pitch_dumbbell')
colnames.predicting <- c(colnames.predicting, 'gyros_dumbbell_x')
colnames.predicting <- c(colnames.predicting, 'magnet_dumbbell_z')
colnames.predicting <- c(colnames.predicting, 'pitch_forearm', 'yaw_forearm', 'total_accel_forearm')
colnames.predicting <- c(colnames.predicting, 'gyros_forearm_x')
colnames.predicting <- c(colnames.predicting, 'accel_forearm_y')
Again let’s clean the data using the definitive list of candidates.
# Create training and testing clean data frames.
testing.clean <- testing[, c('classe', colnames.predicting)]
training.clean <- training[, c('classe', colnames.predicting)]
This predictors selection generates the following correlation matrix diagram where all the variables are correlated between them below 50% .
# Plot correlation matrix
corrgram(training.clean, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Predictors correlation matrix")
Before to continue with the model creation we check that none of the selected predictors is a variable with very small variance.
nearZeroVar(training.clean, saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## classe 1.472 0.04336 FALSE FALSE
## roll_belt 1.074 8.70621 FALSE FALSE
## pitch_belt 1.052 13.89178 FALSE FALSE
## gyros_belt_x 1.056 1.11863 FALSE FALSE
## gyros_belt_y 1.167 0.58099 FALSE FALSE
## gyros_belt_z 1.088 1.41346 FALSE FALSE
## magnet_belt_y 1.082 2.42803 FALSE FALSE
## roll_arm 50.575 19.70170 FALSE FALSE
## pitch_arm 88.000 22.70205 FALSE FALSE
## yaw_arm 40.460 21.59209 FALSE FALSE
## total_accel_arm 1.002 0.55498 FALSE FALSE
## gyros_arm_x 1.019 5.40236 FALSE FALSE
## accel_arm_x 1.020 6.51231 FALSE FALSE
## roll_dumbbell 1.013 86.96670 FALSE FALSE
## pitch_dumbbell 2.364 84.98092 FALSE FALSE
## gyros_dumbbell_x 1.020 1.95976 FALSE FALSE
## magnet_dumbbell_z 1.000 5.68852 FALSE FALSE
## pitch_forearm 69.152 21.36663 FALSE FALSE
## yaw_forearm 14.812 14.29934 FALSE FALSE
## total_accel_forearm 1.070 0.58099 FALSE FALSE
## gyros_forearm_x 1.016 2.40201 FALSE FALSE
## accel_forearm_y 1.033 8.39403 FALSE FALSE
As we can see from the plot and this NZV result we have 21 predictor variables that are uncorrelated and none of them is a NZV, so we will use this list of variables for training the model.
In this final section we create a model for the prediction of the type of exercise related to the user movement data. To develop this task we first create a model using the random forest algorithm with the default package options and:
Let’s create the random forest model.
# Create the prediction model using random forest.
set.seed(1251)
prediction.model <- randomForest(classe ~ ., importance = TRUE, data = training.clean)
We obtain the following results applied over the training data:
print(prediction.model)
##
## Call:
## randomForest(formula = classe ~ ., data = training.clean, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 1.41%
## Confusion matrix:
## A B C D E class.error
## A 3268 7 2 2 4 0.004569
## B 19 2184 24 4 0 0.021067
## C 0 14 1983 13 2 0.014414
## D 0 3 48 1836 2 0.028057
## E 1 2 8 8 2098 0.008975
Now we want to apply a cross validation process and check the results of the predictions when we consider the testing data. The confusion matrix and overall statistics related to the prediction operation obtained with this operation are:
# Test the results against the testing data.
prediction.testing <- predict(prediction.model, newdata=testing.clean)
confusionMatrix(testing.clean$classe, prediction.testing)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2178 6 2 1 1
## B 11 1458 18 0 0
## C 2 12 1315 9 2
## D 0 2 22 1234 0
## E 0 0 2 5 1404
##
## Overall Statistics
##
## Accuracy : 0.988
## 95% CI : (0.985, 0.99)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.984
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.994 0.986 0.968 0.988 0.998
## Specificity 0.998 0.995 0.996 0.996 0.999
## Pos Pred Value 0.995 0.980 0.981 0.981 0.995
## Neg Pred Value 0.998 0.997 0.993 0.998 1.000
## Prevalence 0.285 0.192 0.177 0.163 0.183
## Detection Rate 0.283 0.190 0.171 0.161 0.183
## Detection Prevalence 0.285 0.194 0.174 0.164 0.184
## Balanced Accuracy 0.996 0.991 0.982 0.992 0.998
We can see that the model created performs a very acceptable classification process over the test data with high general accuracy of 98.7% (96.2% minimum class Sensitivity or Specificity), with a very short 95% confidence interval (0.984, 0.989) and small P-Value (<2e-16 ).
Finally we want to obtain the prediction data related to the assignment.
# Test the results against the testing data.
predicting.testing <- predict(prediction.model, newdata=predicting)
This leads to a final resutl of 100% of possitive predictions at the submission coursera course.